Suppose you conduct a study where you enroll 200 patients with mild dementia symptoms, then
randomize them so that 100 receive an experimental drug intended for mild dementia symptoms, and
100 receive a placebo. You have the participants take their assigned product for six weeks, then you
record whether each participant felt that the product helped their dementia symptoms. You tabulate the
results in a fourfold table, like Figure 13-5.
© John Wiley & Sons, Inc.
FIGURE 13-5: Comparing a treatment to a placebo.
According to the data in Figure 13-5, 70 percent of participants taking the new drug report that it
helped their dementia symptoms, which is quite impressive until you see that 50 percent of participants
who received the placebo also reported improvement. When patients report therapeutic effect from a
placebo, it’s called the placebo effect, and it may come from a lot of different sources, including the
patient’s expectation of efficacy of the product. Nevertheless, if you conduct a Yates chi-square or
Fisher Exact test on the data (as described in Chapter 12) at α = 0.05, the results show treatment
assignment was statistically significantly associated with whether or not the participant reported a
treatment effect (
by either test).
Looking at inter- and intra-rater reliability
Many measurements in epidemiologic research are obtained by the subjective judgment of humans.
Examples include the human interpretation of X-rays, CAT scans, ECG tracings, ultrasound images,
biopsy specimens, and audio and video recordings of the behavior of study participants in various
situations. Human researchers may generate quantitative measurements, such as determining the length
of a bone on an ultrasound image. Human researchers may also generate classifications, such as
determining the presence or absence of some atypical feature on an ECG tracing.
Humans who perform such determinations in studies are called raters because they are assigning
ratings, which are values or classifiers that will be used in the study. For the measurements in your
study, it is important to know how consistent such ratings are among different raters engaged in rating
the same item. This is called inter-rater reliability. You will also be concerned with how
reproducible the ratings are if one rater were to rate the same item multiple times. This is called intra-
rater reliability.
When considering the consistency of a binary rating (like yes or no) for the same item between two
raters, you can estimate inter-rater reliability by having each rater rate the same group of items.
Imagine we had two raters rate the same 50 scans as yes or no in terms of whether each scan showed a
tumor or not. We cross-tabbed the results and present them in Figure 13-6.